We are not going deep into deep learning introduction, because we have a different agenda for this tutorial. But I will provide some links for you and also let's just mention what Wikipedia says about deep neural networks:
https://en.wikipedia.org/wiki/Deep_learning
"A deep neural network (DNN) is an ANN (artificial neural network) with multiple hidden layers between the input and output layers. Similar to shallow ANNs, DNNs can model complex non-linear relationships. DNN architectures generate compositional models where the object is expressed as a layered composition of primitives. The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network."
Here are some links to get you up and running with neural networks:
https://youtu.be/1L0TKZQcUtA - fantastic set of lectures by Lex Fridman (MIT). These will give you a high level overview of Deep Learning.
http://playground.tensorflow.org - interactive neural network constructor. Best way to play around with neural networks for a beginner
https://youtu.be/aircAruvnKk - visual introduction to neural networks by 3Blue1Brown
https://youtu.be/uXt8qF2Zzfo - great technical introduction to neural networks by Patrick Winston (MIT)
https://youtu.be/VrMHA3yX_QI - another great lecture by Prof. Patrick Winston called "Deep Neural Nets"
https://youtu.be/ILsA4nyG7I0 - How Deep Neural Networks Work. Title speaks for itself. Introduction to deep neural networks by Brandon Rohrer (Facebook)
https://youtu.be/rEDzUT3ymw4 - pretty good 1 minute intro to neural networks
https://youtu.be/bxe2T-V8XRs - technical but useful introduction to neural networks by Welch Labs
After you get the idea of what neural networks are:
https://www.coursera.org/learn/calculus1 - this course will give you all the calculus knowledge required. You will easily understand derivatives, chain rule, gradients and backpropagation used in neural networks. Took this course myself. 5/5 stars!
https://www.deeplearning.ai/ - this course by Andrew Ng is the best. It has everything you need to get good at deep learning. I took it myself and i highly recommend it!
http://www.mathtutordvd.com/products/The-Calculus-3-Tutor-Volume-1.cfm - here is another fantastic course on calculus by Jason Gibson. After watching it your understanding of chain rule and gradients in context of neural networks will be in top shape.
After (really) getting familiar with neural networks and understanding how they work (including linear algebra, calculus and probability theory) you can move on and try this book: http://www.deeplearningbook.org/ It is highly technical, hard to read. Probably it will take year or so for a beginner. But it's worth it to get good in this field, particularly if you want to do some research. (btw it will probably be easy enough for a beginner who has good mathematical background (linear algebra, statistics, calculus)).
We need these libraries to plot all the data we have. We will use matplotlib, seaborn and plotly libraries to make plots.
import numpy as np
import math
import pandas as pd
from scipy.special import expit
import matplotlib
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import seaborn as sns
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)
The code you see below is taken from playground.tensorflow.org github repository which is located here: https://github.com/tensorflow/playground and then adapted to our needs.
Basically getCircleLabel() function is moved out of getCircleData() function to make it callable globally (we will need it later in our code). Also we "suppress" outer circle (blue data points) variance to make our experiment more simple and clean.
We also set random seed to a fixed value to generate same data when running our functions again and again.
If you want to change data points count, just call getTrainingData(data_points_count = 1000) like this and set how many point you want using "data_points_count" argument. Note however that the more data points there are the harder it is for the random generator to come up with appropriate weights and biases to classify orange and blue data points (see code below).
* a little stop, since we will use this term a lot: what is data point?
How a row of features (for example Name, Age, Sex et.c) can be named simply as “data point”? This can be not that obvious.
But if you think about some set of features (x1,x2..xn) they may be represented as actual point in n-dimenstional space.
For example on a 2D plane of Work experience (x axis) and Salary (y axis) specific work experience and salary can be shown as a dot/point on a x-y axis, for example with coordinates (10, 70,000) – 70K$ for a 10 years of experience.
So a “data point” from ML/statistics vocabulary fits perfectly here.
# this code is based on https://github.com/tensorflow/playground/blob/master/src/dataset.ts
class Example2D:
def __init__(self, x, y, label):
self.x = x
self.y = y
self.label = label
def __repr__(self):
return "x: {}, y: {}, label: {}".format(self.x, self.y, self.label)
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
"""
This function is used to get (numeric) label for the given point.
There are 2 classes (blue and orange) we have.
0 stands for blue and 1 stands for orange,
so 2 labels are possible: 0 and 1 (depending on how far from the center of the circle point is)
Args:
p: a point (see class Point).
center: a point which specifies coordinates of a circle's center.
radius: radius of the circle.
Returns:
1 if point belongs to class 1 (orange), 0 otherwise (blue point).
"""
def getCircleLabel(p, center, radius):
return 1 if (dist(p, center) < (radius * 0.5)) else 0
def getCircleData(numSamples, noise):
points = []
radius = 5
# Generate points inside the circle (orange points, lalbel=1).
for i in range(0, int(numSamples / 2)):
r = np.random.uniform(0, radius * 0.5)
angle = np.random.uniform(0, 2 * math.pi)
x = r * math.sin(angle)
y = r * math.cos(angle)
noiseX = np.random.uniform(-radius, radius) * noise
noiseY = np.random.uniform(-radius, radius) * noise
label = getCircleLabel(Point(x=x + noiseX, y=y + noiseY), Point(x=0, y=0), radius)
points.append(Example2D(x, y, label))
# Generate points outside the circle (blue points, label=0).
for i in range(0, int(numSamples / 2)):
# r = np.random.uniform(radius * 0.7, radius) # original playground code (outer circle with some variance)
r = radius # modifying playground code to get nice outer circle (without noise/variance)
angle = np.random.uniform(0, 2 * math.pi)
x = r * math.sin(angle)
y = r * math.cos(angle)
noiseX = np.random.uniform(-radius, radius) * noise
noiseY = np.random.uniform(-radius, radius) * noise
label = getCircleLabel(Point(x=x + noiseX, y=y + noiseY), Point(x=0, y=0), radius)
points.append(Example2D(x, y, label))
return points
def dist(a: Point, b: Point):
dx = a.x - b.x;
dy = a.y - b.y;
return math.sqrt(dx * dx + dy * dy)
def getTrainingData(data_points_count = 1000):
np.random.seed(42)
noise = 0
points = getCircleData(numSamples=data_points_count, noise=noise)
x1 = [point.x for point in points] # put x coordinate of each point into x1 list
x2 = [point.y for point in points] # put y coordinate of each point into x2 list
x1 = np.array(x1)
x2 = np.array(x2)
label = [point.label for point in points] # store label/class (0=blue data point; 1=orange) of each point in label list
return x1,x2,label # x1 = x coordinate, x2 = y coordinate, label = class of a point (0 or 1)
# get x coordinate of each point, y coordinate of each point and label for each point (blue=0, orange=1)
x1, x2, label = getTrainingData()
This is how our data looks visually:
df = pd.DataFrame({'x1': x1, 'x2': x2, 'label': label})
sns.lmplot(x="x1", y="x2", hue="label", data=df, fit_reg=False)
As you can see above all blue data points are in the outer circle while orange data points are concentrated near the center of the circle. Blue data points are labeled as 0 and orange data points are labeled as 1. Each data point consists of 2 values: x1 (x axis on the plot) and x2 (y axis on the plot). We can also say that each data point consists of 2 features: x1 and x2.
Here we will create a simple neural network which consists of input layer (layer #0), 1 hidden layer (layer #1) and output layer
INPUT LAYER of x1,x2 features -> HIDDEN LAYER of 3 neurons -> OUTPUT LAYER of 1 neuron
For now we are setting so called weights and biases to some random/arbitrary values. Currently we just want to start, so we are only interested in creating a basic architecture for our needs. (Note also that we are not using any kind of error function in the end of the network so backpropagation is not used at all currently. We only want to create a model (simple neural network))
Here is a visual representation of a network we are going to build:

# INPUT LAYER STRUCTURE:
# input layer is just a vector of x1 and x2
x1, x2, label = getTrainingData()
# HIDDEN LAYER STRUCTURE:
# neuron 1 of hidden layer 1
b1 = 0
w1 = [-0.4, 0.6] # so called "weights" are just multipliers in front of each feature (see line right below)
z1 = w1[0] * x1 + w1[1] * x2 + b1 # pre-activation block of a neuron (note also that we add so called "bias" which we will explain later)
a1 = expit(z1) # activation block of a neuron (expit is actually a sigmoid function)
# neuron 2 of hidden layer 1
b2 = 0
w2 = [0.2, 0.8]
# let's use this notation (np.dot) instead of w2[0] * x1 + w2[1] * x2 + b2
# np.dot computes and returns dot product of two arrays (which simply is a sum of weights-multiplied-by-features)
# it is used all over the place in neural networks
# imagine if we had much more weights..
# In this case it is easier and faster to use dot product
# as you can see we pass only w2 vector instead of it's components like w2[0], w2[1] etc.
# doc: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.dot.html
# take a look at this also: https://youtu.be/tKcLaGdvabM - Broadcasting in Python (C1W2L15)
# also simply google "what is dot product" or "dot product explained"
z2 = np.dot(w2, [x1, x2]) + b2
a2 = expit(z2)
# neuron 3 of hidden layer 1
b3 = 0
w3 = [0.7, 0.8]
z3 = np.dot(w3, [x1, x2]) + b3
a3 = expit(z3)
# OUTPUT LAYER STRUCTURE:
# neuron of the output layer
b_out = 0
w_out = [-1.2, 2.0, 0.1]
z_out = np.dot(w_out, [a1, a2, a3]) + b_out
a_out = expit(z_out)
# HELPFUL STUFF NOT RELATED TO NEURAL NETWORK
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
# we will use this "original" sum of x1 and x2 to see how data is changing as it travels through different neurons
x1x2_sum = x1 + x2
# creating pandas dataframe here which we will need to make plots
df = pd.DataFrame({'x1': x1, 'x2': x2, 'label': label})
df['x1x2_sum'] = x1x2_sum
df['z1'] = z1
df['a1'] = a1
df['z2'] = z2
df['a2'] = a2
df['z3'] = z3
df['a3'] = a3
df['z_out'] = z_out
df['a_out'] = expit(z_out)
df['prediction'] = prediction
# let's see some of missclassified data points
display(df[(df.label == 0) & (df.prediction == 1)].head(5))
Visual representation of our (untrained yet) neural network
sns.lmplot(x="x1", y="x2", data=df, hue="label", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("fig. 1, original data points (x1, x2)")
sns.lmplot(x="z1", y="a1", data=df, hue="label", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("fig. 2, [hidden layer #1, neuron #1] preactivation z1 is compared to activation a1")
sns.lmplot(x="z2", y="a2", data=df, hue="label", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("fig. 3, [hidden layer #1, neuron #2] preactivation z2 is compared to activation a2")
sns.lmplot(x="z3", y="a3", data=df, hue="label", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("fig. 4, [hidden layer #1, neuron #3] preactivation z3 is compared to activation a3")
sns.lmplot(x="z_out", y="a_out", data=df, hue="label", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("fig. 5, [output layer, neuron #1] preactivation z_out is compared to activation a_out")
What you see above is a visual summary of what is happening as input data travels through our neural network. Currently network is UNTRAINED (which means it classifies our data points chaotically) so it is not very useful. But still, at least now we have neural network to work with and it is good to see how data transformation looks like as it goes through different neurons.
In the 1st plot above for example we have original data points (pairs of x1, x2) plotted. As expected this is a circle which contains high concentration of orange points in the middle of a circle and blue points which form a shape of the circle.
In all remaining plots (fig. 2 - 5) we see what happens to data in each neuron as pre-activation output is transformed by activation segment of a neuron.
For now you can think of fig. 2 to 5 as a circle transformation into curvy line specific to activation at corresponding neuron. Each curvy line looks more or less like an "S" shape thanks to so called sigmoid function usage during activation.
We will need to understand sigmoid really really well, because it will reveal to us how deep neural network works. So let's get started!
Sigmoid function has the following equation:
\begin{align} \sigma(z) & = \frac{1}{1+e^{-z}} \\ \end{align}In our case "z" argument to a sigmoid function is always calculated by multiplying neuron inputs by corresponding weights and adding bias, like this: z = weights * inputs + bias. For example for neuron 2 of our hidden layer it is:
z2 = np.dot(w2, [x1, x2]) + b2
We are conveniently using "expit" function from scipy library which allows us to pass not only 1 argument but a list of values.
For now let's produce a simple sigmoid with some evenly spaced numbers over a specified interval passed as argument to explore it's behavior:
arr = np.arange(-10,10,0.01) # generate numbers from -10 to 10 with step=0.01
# let's prety print these numbers
np.set_printoptions(suppress=True) # to avoid scientific notation
print(arr[0:10]) # print first 10 numbers
Let's use arr values to plot sigmoid:
data = [dict(
line=dict(color='00CED1', width=3),
name = 'sigmoid',
x = arr,
y = expit(arr))]
plotly.offline.iplot(data)
Now we see why it is called a "sigmoid" function. It looks more or less like "S".
A nice property of a sigmoid is that it compresses values to a range from 0 to 1, which is exactly the same range as probability has -> from 0 to 1. This is why we can use it as actual predictor. All arguments ending up on sigmoid at 0.5 (including) and above can be classified as 1 and all values "falling down" on sigmoid below 0.5 (excluding) can be classified as 0.
This is why in binary classification sigmoid (as well as softmax) function is usually added at the end of a neural network as an output layer's neuron. Every data point signal which enters the network will end up (in some modified way) at the end of the network, going through this last sigmoid which will produce a probability of the item being 0 or 1 (again, usually convention is that < 0.5 probability will end up as class 0 and sigmoid's output >= 0.5 will be classified as class 1 - after receiving sigmoid output we just assign probabilities like 0.6789 or 0.2355 to class 1 or 0 correspondingly).
Now let's make some sigmoid tuning. Let's scale each of our arr values just before they are passed to sigmoid and see how it affects the sigmoid's shape..
# create sigmoid with slider.
# Checki these links for details:
# https://plot.ly/python/sliders/
# https://plot.ly/python/gapminder-example/
weights = np.arange(-5,5,0.1)
# also add really big weights to see how sigmoid creates a hard threshold
weights = np.insert(weights, 0, -1_000, axis=0)
weights = np.insert(weights, int((len(weights) + 1)/2), 0, axis=0) # set middle value to pure 0 (in case it was -0.0001 or something)
weights = np.append(weights, [1_000])
data = [dict(
visible = False,
line=dict(color='000080', width=3),
name = 'weight = '+str(weight),
x = arr,
y = -expit(weight * arr)) for weight in weights]
data[62]['visible'] = True
steps = []
for i in range(len(data)):
step = dict(
method = 'restyle',
args = ['visible', [False] * len(data)],
label = "{0:.2f}".format(weights[i])
)
step['args'][1][i] = True # Toggle i'th trace to "visible"
steps.append(step)
sliders = [dict(
active = 62,
currentvalue = {"prefix": "Weight: "},
pad = {"t": 50},
steps = steps
)]
layout = dict(sliders=sliders)
fig = dict(data=data, layout=layout)
# fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
As you can see, by adjusting slider on the plot above, weight can significantly tune the shape of our sigmoid function.
For example rightmost position of a slider will call a sigmoid function with a very big weight +1000 (specifically we call expit(1000 * arr)) and it will make it look like a square with a hard threshold between 0 and 1 values. It means that any negative x value we pass to a sigmoid will return 0 and any positive x value will return 1 (because we multiply x by 1000 just before sigmoid receives it).
If we set weight to a very small value then argument scaled by this weight will be concentrated around 0 (x axis) when it is passed to sigmoid. And since the middle region of a sigmoid is (almost) linear, plot will look like a line which slowly moves up or down (depending on the weight sign). For example try to set slider to value 0.10 or -0.10 (negative weight just flips the sigmoid horizontally).
So the higher the weight (both negative and positive) the more square-like/limiting sigmoid's shape becomes. The lower the weight, the smoother (more stretched) the sigmoid becomes.
What happens if we use negative weights? The sigmoid will flip horizontally. It is easy to understand by example:
let's say we have x = 2 and weight = -1000 or weight = 1000. When we use -1000 * 2 as sigmoid input we will get 0 at output. When we use -1000 * -2 (notice negative x) we will get 1.
Also if we set weight exactly to 0 (or in other words 0 * arr means we pass 0 to sigmoid), the output will simply be a line parallel to x axis at y = 0.5.
Interestingly original arr values stays the same, but thanks to weights we can send a modified version of it to sigmoid. This trick allows us to get different sigmoid outputs for the same x values we pass into sigmoid.
So as you can see sigmoid provides a lot of tuning options just by using multiplier (weight) in front of an argument just before passing it to sigmoid.
Btw if we use negative sign just before sigmoid itself, let's say -expit(x), sigmoid will be flipped vertically. For example -expit(10_000) will return -1 (which means expit(10_000) returns 1 and by adding - in front of it we flip it vertically).
If needed, just play around with the slider above a bit to make weight-shape relationship more intuitive.
Later we will also see how bias affects our data.
But neural networks would be useless if they couldn't classify data points.
Ok, but what means to classify data points?
It means that (regarding our binary classification experiment) when we send a data point to our neural network model's input layer (btw model = how we see the world), at it's output layer we receive a probability of that data point belonging to class 0 (blue label, probability < 0.5) or class 1 (orange label, probability >= 0.5 at neural network's output).
And how is neural network is doing it?
First of all, all data points we get are generated using some process/function. It can be as simple as sine or cosine functions, and it also can be a very complex function. Most of the times it is so complex that we cannot guess it (and after all we don't want to guess, we want to delegate as much work as possible to computers while we chillin on the beach, right?).
This is where deep neural networks can help.
It turns out that neural networks are function approximators. In our experiment here what trained neural network does is it creates such an approximation-function of a true data-points-generation-function which "allows" given data points, after they are passed through the output layer's activation sigmoid, to end up at orange region (>= 0.5) of a sigmoid for orange-class data points and in the blue region (< 0.5) of a sigmoid for blue-class data points.
Real life example of a complex function which neural network is able to approximate with more or less success (let's say 80% accuracy for a good model):
There is a famous titanic data set on kaggle.com https://www.kaggle.com/c/titanic where we are provided with information about passengers (like sex, age, class of cabin, ticket price and other features). We also have a training data with a list of passengers who survived and who didn't and test data where it is unknown who survived and who didn't. This is a binary classification problem, just like we have a binary classification problem of blue and orange points. So our goal is to make (as correctly as possible) predictions for test data based on training data we have.
But data is so complex that we humans cannot understand what is the function which defines who survived and who didn't (in other words what is the relationship between target variable ("survived") and all the features (like age, sex) we have - what is the dependency). This is a perfect case to try to use neural networks on and other machine learning techniques to create a model which will hopefully make correct predictions, for as much passengers as possible, who survived and who didn't.
So our goal in this case would be to find such an approximation function f(age, sex, cabin class, children onboard,...) which in as much as possible cases returns correct answer survived/not survived (basically 1/0) for each given passenger.
Now the question is - why do we focus on sigmoid function so much?
This is because sigmoid function (as well as many other non-linear functions like tanh/relu/elu etc.) is used by a neural network as a main tool to create all kind of function approximations. How? It turns out non-linear functions are magical. For example by combining multiple sigmoid outputs we can get totally different functions, like sine, cosine and much more complex ones (see titanic example above - yes sigmoids can help us to figure out who survived and who didn't).
Let's see this in practice to make our understanding much clearer:
# let's plot a sigmoid out of x values scaled by some big weight (1000)
x = np.arange(-10,10,0.01) # generate numbers from -10 to 10 with step=0.01
data = [dict(
line=dict(color='00CED1', width=3),
name = 'sigmoid',
x = x,
y = expit(1000 * x))]
plotly.offline.iplot(data)
# now let's lets see what happens if we introduce some bias, say -2500
x = np.arange(-10,10,0.01) # generate numbers from -10 to 10 with step=0.01
data = [dict(
line=dict(color='00CED1', width=3),
name = 'sigmoid',
x = x,
y = expit(1000 * x - 500))] # <- weight = 1000, bias = -2500
plotly.offline.iplot(data)
As you can see bias has no effect on the shape of our function, but it shifts it to the right or to the left of origin. The higher the weight the higher the bias value is needed to shift the function. Correspondingly it will be enough to set smaller bias for smaller weights. Just play with the plot above to get comfortable with bias. Change weight and bias to see the relationship.
So usage of bias allows even more flexibility. It turns out that bias allows us to easily move data points position on the sigmoid. Let's say if we have blue dots on the left and orange dots on the right, we can easily place blue dots exactly below 0.5 value and orange dots to >= 0.5 range (we will see it in practice later).
# Now let's see what happens when we combine 2 sigmoids together!
x = np.arange(-10,10,0.01)
data = [dict(
line=dict(width=2),
name = 'sigmoid',
x = x,
y = expit(100 * x) - expit(100 * x - 150))] # using 2 sigmoids here to generate y (combination/sum of 2 sigmoid outputs)
plotly.offline.iplot(data)
So let's explain what exactly happened in the plot above:
First of all we generate array of data points, which is basically x coordinates from -10 to 10 with step 0.01.
Then, just before we pass these values to 1st sigmoid, we multiply them by big weight=100 to tune our sigmoid to make it have more squarelike edges (this is because sigmoid compresses big negative numbers to almost 0 and big positive numbers to almost 1).
Next we add another sigmoid with a negative sign (so in other words we subtract 2nd sigmoid from the 1st one). From the comments above we know that we can flip sigmoid upside down by multiplying the whole function by −1. This is why we see that the resulting function first goes up and then it goes down (we subtract 2nd sigmoid's output from the 1st one).
The gap in the region from 0 to 1.6 or so is there thanks to a negative bias we use in the second sigmoid: -expit(100 * x - 150) <- weight = 100; bias = -150. From our experiments above we know that negative bias will shift sigmoid to the right.
In general we apply function transformations to each sigmoid to achieve results we need.
Let's try to make a more smooth function out of 2 sigmoids:
# Creating more smooth function using same technique (combination of 2 sigmoid outputs):
x = np.arange(-10,10,0.01)
data = [dict(
line=dict(width=2),
name = 'sigmoid',
x = x,
y = expit(2*x + 3) - expit(2*x - 3))] # using 2 sigmoids here to generate y (combination/sum of 2 sigmoid outputs)
plotly.offline.iplot(data)
To get comfortable with creating various functions, try to play around with the plot above by changing weights, biases and sign of each sigmoid. Imagine some simple shapes and try to achieve them by tweaking the parameters.
This by the way (finding good parameters - weights and biases) is a primary duty of a neural network training, which we will see soon. It's goal is to come up with such a shape/function which approximates original data generating function as close as possible. And the more sigmoids (or other nonlinear functions) it can use, the more complex function it can approximate (up to some point).
Let's see how 3 sigmoids are working together by playing with weights, biases and sigmoid signs.
x = np.arange(-10,10,0.01)
sigm1 = expit(5*x + 3)
sigm2 = -expit(5*x - 3) # notice minus sign in front of this sigmoid
sigm3 = expit(2.5*x -10)
data = [dict(
line=dict(width=2),
name = 'sigmoid',
x = arr,
y = sigm1 + sigm2 + sigm3)] # using 3 sigmoids here to generate y (combination/sum of 3 sigmoid outputs)
plotly.offline.iplot(data)
Now, armed with all the required knowledge, let's go back to our circle example and see how function approximator (neural network) can be used in practice!
There is a standard mechanism in neural networks called backpropagation. It is used to find appropriate (optimal if possible) parameters (weights and biases) to classify data points as accurately as possible. To make our experiment more simple we will not use backpropagation but instead we will use random search to find appropriate weights and biases.
For the case like ours it is ok to use random search instead of backpropagation because we only have 3 neurons in the hidden layer and 1 neuron in the output layer.
Also random search will not give us an optimal solution but at least it will find parameters which will classify all training data points we have correctly which is enough for our demo.
You may ask, do we use both random search and backpropagation in real life? The answer is that in real life we only use backpropagation because it works much faster than a random search, because it improves parameter values (towards better classification outcomes) in a systematic way while random search works chaotically.
Imagine if we had 1_000_000 weights. In this case there is no even a minimal chance random search will succeed during some reasonable period of time. It can take years of huge computer resources to find appropriate weights and biases by randomly trying them (at least with hardware we have today). On the other hand, backpropagation can reasonably quickly improve neural network's model.
Backpropagation has limits as well btw. As we add more hidden layers to the system it becomes much harder for the backpropagation to do it's job. Currently people are coming up with techniques like synthetic gradient descent etc., but this is out of scope of this discussion.
But back to our case. Let's now take our example defined above and use random search until we come up with some good weights and biases for each neuron of a hidden layer and for a neuron of the output layer.
# This code creates a neural network with randomly generated parameters (weights and biases)
# again and again, until it finds such configuration/parameters that all data points are classified correctly
# Keep in mind that EACH TIME you run this script, parameters found will be different due to randomness
parameters = []
# set this variable to True if you want to use precalculated parameters this notebook was created with
use_existing_parameters = True # False = random search parameters; True = use already found parameters
if use_existing_parameters:
# hidden layer #1 neuron #1 bias
b1 = 2.78966839301839
# hidden layer #1 neuron #2 bias
b2 = -2.9583465842663803
# hidden layer #1 neuron #3 bias
b3 = -1.2706716633454984
# output layer neuron #1 bias
b_out = -0.022314466218519513
# hidden layer #1 neuron #1 weights (1 weight per feature)
w1 = [0.7656994723175665, -0.880436541624743]
# hidden layer #1 neuron #2 weights
w2 = [0.8567863141597765, 0.35836350843155396]
# hidden layer #1 neuron #3 weigths
w3 = [-0.8271544911925803, -1.228378092873149]
# output layer neuron #1 weights
w_out = [4.2596078004961075, -9.234952596030062, -4.34764412206893]
# Now we only need to access this variable and it will give us all parameters we need for our network
parameters = [
b1, b2, b3, b_out, w1, w2, w3, w_out
]
else:
# to find parameters reasonably quickly, we are using only 100 points
data_points_count = 100
x1, x2, label = getTrainingData(data_points_count=data_points_count)
left_border = -1.5
right_border = 1.5
label1_cnt = np.count_nonzero(label)
cnt = 0
parameters = None # we will store good parameters here
while True: # <- same as iterate forever (unless we stop it from inside the loop)
# generate 6 random values in given range and assign them to corresponding weights of the hidden layer
rnd = np.random.uniform(left_border,right_border,6)
w1_i1 = rnd[0]
w1_i2 = rnd[1]
w2_i1 = rnd[2]
w2_i2 = rnd[3]
w3_i1 = rnd[4]
w3_i2 = rnd[5]
# generate 3 random weights and assign them to output layer weights
rnd = np.random.uniform(-10, 10, 3)
w_out_i1 = rnd[0]
w_out_i2 = rnd[1]
w_out_i3 = rnd[2]
# generate 4 random values and assign them to corresponding biases
rnd = np.random.uniform(-3, 3, 4)
b1_i1 = rnd[0]
b2_i1 = rnd[1]
b3_i1 = rnd[2]
b_out_i1 = rnd[3]
if cnt % 100_000 == 0:
print("Iteration {}".format(cnt))
cnt += 1
# HIDDEN LAYER:
# neuron 1 of layer 1
b1 = b1_i1
w1 = [w1_i1, w1_i2]
z1 = np.dot(w1, [x1, x2]) + b1
a1 = expit(z1)
# neuron 2 of layer 1
b2 = b2_i1
w2 = [w2_i1, w2_i2]
z2 = np.dot(w2, [x1, x2]) + b2
a2 = expit(z2)
# neuron 3 of layer 1
b3 = b3_i1
w3 = [w3_i1, w3_i2]
z3 = np.dot(w3, [x1, x2]) + b3
a3 = expit(z3)
# OUTPUT LAYER:
# neuron of the output layer
b_out = b_out_i1
w_out = [w_out_i1, w_out_i2, w_out_i3]
z_out = np.dot(w_out, [a1, a2, a3]) + b_out
a_out = expit(z_out)
# PREDICTIONS:
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
# CHECK NETWORK's "QUALITY":
# make sure prediction vs. label difference is acceptable
accuracy = np.sum(label == prediction) / data_points_count
min_acceptable_accuracy = 1.0
if accuracy >= min_acceptable_accuracy: # if so, set parameters and exit the loop
print("\nFOUND APPROPRIATE PARAMETERS!!!!")
print("All of them will be assigned to a variable called parameters which we will use later throughout the code:\n")
print("# Following parameters were generated:\n")
print("# hidden layer #1 neuron #1 bias")
print("b1 = {}".format(b1_i1))
print("# hidden layer #1 neuron #2 bias")
print("b2 = {}".format(b2_i1))
print("# hidden layer #1 neuron #3 bias")
print("b3 = {}".format(b3_i1))
print("# output layer neuron #1 bias")
print("b_out = {}".format(b_out_i1))
print("# hidden layer #1 neuron #1 weights (1 weight per feature)")
print("w1 = [{}, {}]".format(w1_i1, w1_i2))
print("# hidden layer #1 neuron #2 weights")
print("w2 = [{}, {}]".format(w2_i1, w2_i2))
print("# hidden layer #1 neuron #3 weigths")
print("w3 = [{}, {}]".format(w3_i1, w3_i2))
print("# output layer neuron #1 weights")
print("w_out = [{}, {}, {}]".format(w_out_i1, w_out_i2, w_out_i3))
# Now we only need to access this variable and it will give us all parameters we need for our network
parameters = [
b1_i1, b2_i1, b3_i1, b_out_i1, [w1_i1, w1_i2], [w2_i1, w2_i2], [w3_i1, w3_i2], [w_out_i1, w_out_i2, w_out_i3]
]
break
What we do in the script above is during each while loop cycle/iteration we create a new neural netwrok with random values generated at that cycle AND we check if ALL data points are classified (orange/blue) correctly.
In case during some particular iteration appropriate weights and biases are found (all data points are correctly classified), we simply terminate the script and save weights and biases values generated for each neuron into "parameters" variable. We then can use these values to initialize our neural network parameters with. So this is exactly what we are doing next:
We now know that sigmoid can take different shapes thanks to it's sign and "weight" multipliers and "bias" used to modify/create argument just before it is passed to sigmoid's input (same for other nonlinear functions).
So at each neuron of our single hidden layer the following happens:
1) pre-activation is calculated (this is exactly where we multiply given inputs (features x1 and x2) by weights and add bias and sum them to prepare input for the sigmoid)
2) activation is calculated (output produced by a sigmoid).
Same for output layer's neuron: this time it takes input from 3 neurons of our hidden layer and uses them to calculate pre-activation. Then final sigmoid is applied to get probabilities of class belonging to blue points or orange ones. Finally we take these probabilities and simply convert them to 0 or 1 to get predictions.
All this is implemented in the script below:
# these datapoints will be the same as we used to find parameters using random search,
# because we use same amount of points (100)
# and in getTrainingData() same random seed is used all the time to generate data points
x1, x2, label = getTrainingData(data_points_count=100)
b1,b2,b3,b_out,w1,w2,w3,w_out = parameters
# neuron 1 of layer 1
z1 = np.dot(w1, [x1, x2]) + b1 # pre-activation
a1 = expit(z1) # activation
# neuron 2 of layer 1
z2 = np.dot(w2, [x1, x2]) + b2 # a linear transformation by weights w2 AND translation by bias b2
a2 = expit(z2) # application of sigmoid function
# neuron 3 of layer 1
z3 = np.dot(w3, [x1, x2]) + b3
a3 = expit(z3)
# neuron of the output layer
z_out = np.dot(w_out, [a1, a2, a3]) + b_out
a_out = expit(z_out)
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
# we will use this "original" sum of x1 and x2 to see how data is changing as it travels through different neurons
x1x2_sum = x1 + x2
# creating pandas dataframe so seaborn library could use it to create plots
df = pd.DataFrame({'x1': x1, 'x2': x2})
# we will use sum of x1 and x2 as reference to see how data is modified as it travels through the network
df['x1x2_sum'] = x1x2_sum
df['tmp'] = 1 # we will need this just make our plots to look better (not related to network data)
df['z1'] = z1
df['a1'] = a1
df['z2'] = z2
df['a2'] = a2
df['z3'] = z3
df['a3'] = a3
df['z_out'] = z_out
df['a_out'] = expit(z_out)
df['prediction'] = prediction
df['label'] = label
display(df.head(3))
# This is what happens as our data point goes through hidden layer's neuron #1
sns.lmplot(x="x1", y="x2", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("original circle data: x1 and x2")
# Here is how our plot looks in 2D. To make 2D possible we use a little trick.
# Instead of plotting x1 and x2 separately which will requre 2 axes,
# we sum them up and now we only have 1 axis - x.
# We use another axis (y) to plot z1. So we have 2D plot as a result.
sns.lmplot(x="x1x2_sum", y="z1", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("linear transformation (weights/multipliers) and translation (bias) of data points")
sns.lmplot(x="z1", y="a1", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("[hidden layer #1, neuron #1] preactivation z1 is compared to activation a1")
What is happening above is we pass our input data to 1st neuron of our hidden layer. There z1 pre-activation value is calculated.
As you can see from the second plot above, x1x2_sum -> z1 plot, data points are linearly transformed (notice how the circle looks now - it is elongated and kinda looks in the different direction now) thanks to w1 weights vector and also translated a bit by adding bias. Notice that bias doesn't change the shape of a circle BUT it moves all data points along z1 axis. In the end of the day what it does is it shifts data points along the sigmoid function (for example some big bias will force data points to concentrate on the rightmost side of a sigmoid and in this extreme case sigmoid will simply look like a line at a1 = 1, hopefully you get the idea).
So after pre-activation, result stored at z1 is passed to activation segment of a neuron and what we see is that last plot above. As you can see data points are spreaded throughout the sigmoid in this particular way/order.
You may ask - why points "fall down" on this neuron's sigmoid shape this way? The answer is simple. It doesn't matter what individual neuron "produce" AS LONG AS when working together/in combination with other neuron output's (for our specific experiment read - other sigmoids) in the end of a network we get output we want, that is data points are classified correctly (where orange points are "falling" on the >= 0.5 side of a sigmoid and blues ones on the left, < 0.5 side).
That said, let's still look at all neurons involved in calculations.
# This is what happens as our data point goes through hidden layer's neuron #2
sns.lmplot(x="x1", y="x2", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("original circle data: x1 and x2")
sns.lmplot(x="x1x2_sum", y="z2", data=df, hue="prediction", fit_reg=False, size=6)
ax = plt.gca()
ax.set_title("[hidden layer #1, neuron #2] preactivation z2 is compared to x1x2_sum")
sns.lmplot(x="z2", y="a2", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("[hidden layer #1, neuron #2] preactivation z2 is compared to activation a2")
# This is what happens as our data point goes through hidden layer's neuron #3
sns.lmplot(x="x1", y="x2", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("original circle data: x1 and x2")
sns.lmplot(x="x1x2_sum", y="z3", data=df, hue="prediction", fit_reg=False, size=6)
ax = plt.gca()
ax.set_title("[hidden layer #1, neuron #3] preactivation z3 is compared to x1x2_sum")
sns.lmplot(x="z3", y="a3", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("[hidden layer #1, neuron #3] preactivation z3 is compared to activation a3")
So we have these sigmoid outputs stored in variables a1, a2 and a3 of the hidden layer for neuron 1,2 and 3 correspondingly. In other words each neuron of the hidden layer is independent and it stores a sigmoid output with some specific shape.
Now we can use these neurons to produce a combination of sigmoids which is neural network's super power to create complex function approximations! Btw this is exactly what the output neuron's pre-activation block will do.
This is why we send all 3 neuron outputs of the hidden layer to another neuron located at the output layer. In our experiment it will be our final neuron because it turns out it is enough to have a network of input layer/hidden layer with 3 neurons/output layer with 1 neuron to produce exactly what we need - a function which separates orange points from the blue ones in the data which looks like a circle of orange points surrounded by blue points.
So let's see what happens next as we feed pre-activation block of the output neuron with outputs of the hidden layer (of course by also scaling these outputs by output neuron's weights and by adding bias specifically tuned for the output neuron).
# This is what happens as each data point goes through output layer's neuron
sns.lmplot(x="x1x2_sum", y="z_out", hue="prediction", data=df, fit_reg=False, size=7)
sns.lmplot(x="x1x2_sum", y="a_out", hue="prediction", data=df, fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("compressing z_out values into range from 0 to 1 to turn it into probabilities")
sns.lmplot(x="z_out", y="a_out", data=df, hue="prediction", fit_reg=False, size=7)
ax = plt.gca()
ax.set_title("point of destination for our data points! This is where actual classification happens!")
See the magic above? Orange points magically moved above while blue ones went down. So after pre-activation in the output's layer (which actually did the job of separating data points) we only needed to send output value of pre-activation further to activation block of a neuron to make data points "fall" on the final sigmoid where we can then easily classify all points located at region >= 0.5 as orange ones and points located in the region < 0.5 of a sigmoid as blue ones.
There was no magic. We just used random search and generated parameters again and again and again until we found such weights and biases for each neuron, that our data points were classified correctly. That's it. There is no any deep underlying mystery behind it. It just turns out that with such parameters the input signal undergoes such transformations that resulting neuron is able to clearly separate our data. Yes, we cheated, because deep neural networks use backpropagation. But for our small experiment random search did the job and made whole process more clear and easier for beginners.
Well, I'll try, but it's more complex and technical so in context of our experiment it is not possible to explain all the details. But still, let's touch the surface as they say:
So backpropagation is a mechanism which consists of several "ingredients". First of them is an error measurement function. Think of it as an "error detector" we attach in the end of a neural network. It can work/detect errors in different ways, but popular method is to calculate so called cross-entropy loss of a network.
Basically what cross-entropy does is it takes each data point, it looks at probability of that data point generated by our network (stored at a_out in our experiment) and it compares it with numeric value of true data point label (when we train neural network, we are provided with labels (read probability=0 or probability=1) for each and every data point, so we know the true class of all data points).
Let's say that for some data point, prediction of our network is 0.2375 (so it thinks it is much more a blue point than orange), but true label for it is 1 (training data says it IS orange).
So clearly this particular data point is misclassified. We do such comparison for each data point and accumulate error along the way. The bigger the error the higher the cross entropy loss is. So in some sense error amount is level of surprise in our predictions comparing to true labels. We want to reduce that error/level of surprise, which also means we want our network to make better predictions.
This is where mathematical tricks kick in! For example cross entropy is a mathematical trick in our case, because in some sense it doesn't belong to a neural network. It only is a sensor/detector of an error. Let's see how it is used.
It turns out neural network is just a big function which consists of set of nested functions like this: f(x) = f3(f2(f1(x))).
In our case it will be a_out = sigmoid(z_out(a1(z1(x1,x2)),a2(z2(x1,x2),a3(z3(x1,x2))) where z1 in turn is z1 = w1[0] x1 + w1[0] x2 + b1
So as you can see it is a connected chain.
But let's go even further and make it more explicit to fully understand how our particular neural network is built:
First of all let's introduce all the involved parts:
in general:
z1 = f1_1(x1, x2)
z2 = f1_2(x1, x2)
z3 = f1_3(x1, x2)
a1 = sigmoid(z1)
a2 = sigmoid(z2)
a3 = sigmoid(z3)
z_out = f2_1(a1, a2, a3)
a_out = sigmoid(z_out)
more explicit:
z1 = w1[0] * x1 + w1[1] * x2 + b1
z2 = w2[0] * x1 + w2[1] * x2 + b2
z3 = w3[0] * x1 + w3[1] * x2 + b3
a1 = sigmoid(z1)
a2 = sigmoid(z2)
a3 = sigmoid(z3)
z_out = w_out[0] * a1 + w_out[1] * a2 + w_out[2] * a3 + b_out
a_out = sigmoid(z_out)
more explicit (using actual expressions in place of z1, z2 and z3):
a1 = sigmoid(w1[0] * x1 + w1[1] * x2 + b1)
a2 = sigmoid(w2[0] * x1 + w2[1] * x2 + b2)
a3 = sigmoid(w3[0] * x1 + w3[1] * x2 + b3)
z_out = w_out[0] * a1 + w_out[1] * a2 + w_out[2] * a3 + b_out
a_out = sigmoid(z_out)
more explicit (using actual expressions in place of a1, a2 and a3):
z_out = w_out[0] * (sigmoid(w1[0] * x1 + w1[1] * x2 + b1)) + w_out[1] * (sigmoid(w2[0] * x1 + w2[1] * x2 + b2)) + w_out[2] * (sigmoid(w3[0] * x1 + w3[1] * x2 + b3)) + b_out
a_out = sigmoid(z_out)
Notice how output layer's neuron applies weights to the sigmoids going out of hidden layer and adds them up (along with bias b_out) to produce complex fuction.
more explicit (using actual expression in place of z_out):
a_out = sigmoid(w_out[0] * (sigmoid(w1[0] * x1 + w1[1] * x2 + b1)) + w_out[1] * (sigmoid(w2[0] * x1 + w2[1] * x2 + b2)) + w_out[2] * (sigmoid(w3[0] * x1 + w3[1] * x2 + b3)) + b_out)
and most explicit expression would be "sigmoid" replaced with 1/(1 + e^-z) where -z is an argument passed to sigmoid function:
a_out = 1/(1 + e^-(w_out[0] * (1/(1 + e^-(w1[0] * x1 + w1[1] * x2 + b1))) + w_out[1] * (1/(1 + e^-(w2[0] * x1 + w2[1] * x2 + b2))) + w_out[2] * (1/(1 + e^-(w3[0] * x1 + w3[1] * x2 + b3))) + b_out))
In general input layer is wrapped by a hidden layer, which is wrapped by the output layer. This is why we get this long equation.
AND FINALLY to get luxury of calculus we introduce cross entropy like this: cross_entropy = f(label, a_out) so it continues that long chain of wrapping one function into the other.
But so what you may ask? It turns out that cross-entropy function is differentiable (as well as sigmoid functions we use) so what this means for our particular context is that we can add another ingredient to the mixer, so called Chain Rule.
In our functions-chain chain rule allows us to "see" how let's say weight for x1 of neuron 1 of our hidden layer affects the cross entropy loss. Or how bias of the output layer affects the cross entropy loss, or how weight of x2 of neuron 3 of our hidden layer affects the cross entropy. Hopefully you get the idea.
Chain rule is magical btw. Again, it allows us to "see" (to measure actually) how wiggling any parameter (any weight or any bias of the network) affects the "error detector's" value. In other words it means if we know how parameters affect the error of our network then we can tune them in such a way that error goes down. This is called optimization or training of a neural network in context of backpropagation and this is how modern deep learning networks are currently trained. We systematically reduce neural network's error by modifying neural network parameters thanks to calculus, by doing so called gradient descent (another ingredient of backpropagation).
These are the queries you can make to google:
Note: you need to know some calculus to understand this - but there are a lot of videos on youtube for that and also there is an excellent course on coursera i took myself and highly recommend to you. Here is a link to this course: https://www.coursera.org/learn/calculus1 It will be enough to take this course to understand differentiation and derivatives and other calculus stuff used in deep learning, including gradients (which are not covered in this course but you will easily pick the concept elsewhere after taking the course mentioned).
Interestingly, backpropagation is an "attachment" to neural network. It's not part of it, but it forces neural network to move in certain direction to fulfill backpropagation's "desire" to reduce error.
So during many years of research people found that neural networks are able to 1) create complex function approximations in such a way that they are very useful to make predictions for yet unseen data points 2) (in some sense) improve the way we want (by "attaching" backpropagation)
As you can see backpropagation requires separate discussion. But good news is that if you followed along with our experiment it will be much easier for you to understand how backpropagation works, because we discussed everything you need just before backpropagation kicks in.
Ok after being distracted by backpropagation intro, let's go back to our example. What we saw above is good, but we need more insights. If we look back at x1x2_sum -> a_out plot, we want to know exactly why orange points are concentrating is such a way, above the blue ones. How is it possible? But we already know how - by combining sigmoids of the hidden layer and making a function which "works" that way.
But we want to get some more insight, right? Yes. And for that let's introduce another dimension - the 3rd one. Because up to this point we only used 2 dimensions (x axis and y axis) just by comparing x1 and x2 SUM (which means single value, because it turns into single point on the x axis) against other values, such as a_out (also a single value, so 2 dimensions).
But now, instead of summing out x1 and x2, let's plot x1 on x axis and x2 on y axis, and other values, such as z1, z_out or a_out on the z axis (so xyz = 3 dimensions). This way we will be able to see a whole picture of what is happening as our data point (x1 feature (x coordinate of a point) and x2 feature (y coordinate of a point)) travels to various destinations inside our neural network, moving towards it's output layer's neuron.
# creating neural network from scratch - exactly the same architecture as we used previously
x1, x2, label = getTrainingData(data_points_count=100)
b1,b2,b3,b_out,w1,w2,w3,w_out = parameters
# neuron 1 of layer 1
z1 = np.dot(w1, [x1, x2]) + b1 # pre-activation
a1 = expit(z1) # activation
# neuron 2 of layer 1
z2 = np.dot(w2, [x1, x2]) + b2 # a linear transformation by weights w2 AND translation by bias b2
a2 = expit(z2) # application of sigmoid function
# neuron 3 of layer 1
z3 = np.dot(w3, [x1, x2]) + b3
a3 = expit(z3)
# neuron of the output layer
z_out = np.dot(w_out, [a1, a2, a3]) + b_out
a_out = expit(z_out)
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
x1x2_sum = x1 + x2
# x1x2 original data
layout = go.Layout(
title='x1x2 2D view (original input)',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2')
)
)
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=np.zeros(x1.shape[0]),
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 vs. z1
layout = go.Layout(
title='x1x2 vs. z1',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2'),
zaxis = dict(title='z1')
)
)
trace1 = go.Scatter3d(
x=x1,
y=x2,
z=z1,
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace1]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 vs. a1
layout = go.Layout(
title='x1x2 vs. a1',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2'),
zaxis = dict(title='a1')
)
)
trace2 = go.Scatter3d(
x=x1,
y=x2,
z=a1,
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace2]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# combined comparison plot
layout = go.Layout(
title='combined comparison plot',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2')
)
)
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=np.zeros(x1.shape[0]),
name='x1x2 original data points (input)',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
trace1 = go.Scatter3d(
x=x1,
y=x2,
z=z1,
name='x1x2 vs. z1',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
trace2 = go.Scatter3d(
x=x1,
y=x2,
z=a1,
name='x1x2 vs. a1',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0, trace1, trace2]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 original data
layout = go.Layout(
title='x1x2 2D view (original input)',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2')
)
)
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=np.zeros(x1.shape[0]),
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 vs. z2
layout = go.Layout(
title='x1x2 vs. z2',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2'),
zaxis = dict(title='z2')
)
)
trace1 = go.Scatter3d(
x=x1,
y=x2,
z=z2,
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace1]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 vs. a2
layout = go.Layout(
title='x1x2 vs. a2',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2'),
zaxis = dict(title='a2')
)
)
trace2 = go.Scatter3d(
x=x1,
y=x2,
z=a2,
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace2]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# combined comparison plot
layout = go.Layout(
title='combined comparison plot',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2')
)
)
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=np.zeros(x1.shape[0]),
name='x1x2 original data points (input)',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
trace1 = go.Scatter3d(
x=x1,
y=x2,
z=z2,
name='x1x2 vs. z2',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
trace2 = go.Scatter3d(
x=x1,
y=x2,
z=a2,
name='x1x2 vs. a2',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0, trace1, trace2]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 original data
layout = go.Layout(
title='x1x2 2D view (original input)',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2')
)
)
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=np.zeros(x1.shape[0]),
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 vs. z3
layout = go.Layout(
title='x1x2 vs. z3',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2'),
zaxis = dict(title='z3')
)
)
trace1 = go.Scatter3d(
x=x1,
y=x2,
z=z3,
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace1]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# x1x2 vs. a3
layout = go.Layout(
title='x1x2 vs. a3',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2'),
zaxis = dict(title='a3')
)
)
trace2 = go.Scatter3d(
x=x1,
y=x2,
z=a3,
mode='markers',
marker=dict(
color=label,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace2]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# combined comparison plot
layout = go.Layout(
title='combined comparison plot',
scene = dict(
xaxis = dict(title='x1'),
yaxis = dict(title='x2')
)
)
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=np.zeros(x1.shape[0]),
name='x1x2 original data points (input)',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
trace1 = go.Scatter3d(
x=x1,
y=x2,
z=z3,
name='x1x2 vs. z3',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
trace2 = go.Scatter3d(
x=x1,
y=x2,
z=a3,
name='x1x2 vs. a3',
mode='markers',
marker=dict(
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0, trace1, trace2]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
trace0 = go.Scatter3d(
x=x1x2_sum,
y=np.zeros(x1x2_sum.shape[0]),
z=a_out,
mode='markers',
marker=dict(
color=prediction,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0]
fig = go.Figure(data=data)
plotly.offline.iplot(fig)
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=a_out,
mode='markers',
marker=dict(
color=prediction,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
data = [trace0]
fig = go.Figure(data=data)
plotly.offline.iplot(fig)
b1,b2,b3,b_out,w1,w2,w3,w_out = parameters
x = np.arange(-10,10,0.20)
y = np.arange(-10,10,0.20)
# to plot 3D surface using plotly library we need to convert our data to xy matrix and yv matrix
# check out this helpful link if you want to understand how meshgird works:
# https://stackoverflow.com/questions/36013063/what-is-purpose-of-meshgrid-in-python
xv, yv = np.meshgrid(x, y)
label = np.zeros((xv.shape[0], xv.shape[1]))
for i in range(len(x)):
for j in range(len(y)):
# treat xv[i,j], yv[i,j]
label[i,j] = getCircleLabel(Point(x=xv[i,j], y=yv[i,j]), Point(x=0, y=0), radius=5)
# CREATING THE SAME NEURAL NETWORK STRUCTURE AS WE USE ABOVE
# neuron 1 of layer 1
z1 = w1[0] * xv + w1[1] * yv + b1
a1 = expit(z1)
# neuron 2 of layer 1
z2 = w2[0] * xv + w2[1] * yv + b2
a2 = expit(z2)
# neuron 3 of layer 1
z3 = w3[0] * xv + w3[1] * yv + b3
a3 = expit(z3)
# neuron of the output layer for 3 sigmoids at hidden 1
z_out = w_out[0] * a1 + w_out[1] * a2 + w_out[2] * a3 + b_out
a_out = expit(z_out)
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
# code needed for the plotly library to create a 3D surface plot
layout = go.Layout(
width=800,
height=800,
autosize=False,
title='plot'
)
layout['scene'] = {
'xaxis': dict(title='x1'),
'yaxis': dict(title='x2')
}
# plot z1, z2 and z3 of the hidden layer (layer #1 in our network)
layout['title'] = '[hidden layer\'s neuron #1] z1'
layout['scene']['zaxis'] = dict(title='z1')
data = [
{"x": xv, 'y': yv, 'z': z1, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = '[hidden layer\'s neuron #2] z2'
layout['scene']['zaxis'] = dict(title='z2')
data = [
{"x": xv, 'y': yv, 'z': z2, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = '[hidden layer\'s neuron #3] z3'
layout['scene']['zaxis'] = dict(title='z3')
data = [
{"x": xv, 'y': yv, 'z': z3, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# plot a1, a2 and a3 (result of passing z1, z2 and z3 correspondingly through activation functions)
layout['title'] = '[hidden layer\'s neuron #1] a1'
layout['scene']['zaxis'] = dict(title='a1')
data = [
{"x": xv, 'y': yv, 'z': a1, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = '[hidden layer\'s neuron #2] a2'
layout['scene']['zaxis'] = dict(title='a2')
layout['scene'] = {
'xaxis': dict(title='x1'),
'yaxis': dict(title='x2'),
'zaxis': dict(title='a2')
}
data = [
{"x": xv, 'y': yv, 'z': a2, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = '[hidden layer\'s neuron #3] a3'
layout['scene']['zaxis'] = dict(title='a3')
layout['scene'] = {
'xaxis': dict(title='x1'),
'yaxis': dict(title='x2'),
'zaxis': dict(title='a3')
}
data = [
{"x": xv, 'y': yv, 'z': a3, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['scene']['zaxis'] = dict(title='z')
layout['title'] = 'comparison of a1 vs. a2 vs. a3 (surfacecolor=z axis)'
data = [
{"x": xv, 'y': yv, 'z': a1, 'type': 'surface', 'showscale': False},
{"x": xv, 'y': yv, 'z': a2, 'type': 'surface', 'showscale': False},
{"x": xv, 'y': yv, 'z': a3, 'type': 'surface', 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = 'comparison of a1 vs. a2 vs. a3 (surfacecolor=label)'
data = [
{"x": xv, 'y': yv, 'z': a1, 'type': 'surface', 'surfacecolor': label, 'showscale': False},
{"x": xv, 'y': yv, 'z': a2, 'type': 'surface', 'surfacecolor': label, 'showscale': False},
{"x": xv, 'y': yv, 'z': a3, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = 'a1 vs. z1 (data compression demo)'
data = [
{"x": xv, 'y': yv, 'z': z1, 'type': 'surface', 'surfacecolor': label, 'name': 'x_y -> z1', 'showscale': False},
{"x": xv, 'y': yv, 'z': a1, 'type': 'surface', 'surfacecolor': label, 'name': 'x_y -> a1', 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = 'a2 vs. z2 (data compression demo)'
data = [
{"x": xv, 'y': yv, 'z': z2, 'type': 'surface', 'surfacecolor': label, 'name': 'x_y -> z2', 'showscale': False},
{"x": xv, 'y': yv, 'z': a2, 'type': 'surface', 'surfacecolor': label, 'name': 'x_y -> a2', 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = 'a3 vs. z3 (data compression demo)'
data = [
{"x": xv, 'y': yv, 'z': z3, 'type': 'surface', 'surfacecolor': label, 'name': 'x_y -> z3', 'showscale': False},
{"x": xv, 'y': yv, 'z': a3, 'type': 'surface', 'surfacecolor': label, 'name': 'x_y -> a3', 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# plot z_out of the output layer
layout['title'] = '[output layers\'s pre-activation] z_out'
layout['scene']['zaxis'] = dict(title='z_out')
data = [
{"x": xv, 'y': yv, 'z': z_out, 'type': 'surface', 'surfacecolor': label, 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = '[output layers\'s pre-activation] z_out'
layout['scene']['zaxis'] = dict(title='z_out')
data = [
{"x": xv, 'y': yv, 'z': z_out, 'type': 'surface'}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# plot a_out of the output layer
layout['title'] = '[output layers\'s activation] a_out'
layout['scene']['zaxis'] = dict(title='a_out')
data = [
{"x": xv, 'y': yv, 'z': a_out, 'type': 'surface', 'surfacecolor': label, 'showscale': False}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
layout['title'] = '[output layers\'s activation] a_out'
layout['scene']['zaxis'] = dict(title='a_out')
data = [
{"x": xv, 'y': yv, 'z': a_out, 'type': 'surface'}
]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
You may think now.. hmm, but this 3D plot is unexpected. When i saw those 2D sigmoid combinations there were just curvy lines in 2D and it was pretty easy to understand, but now we have a 3 dimensional plot. Why is that and how it works?
This is a pretty expected question. It may be hard to connect 2D and 3D plots. But let's think about it. It is pretty simple. When we plot along only x and y axis (read when we have only 1 feature), we get 2D plot. When we plot along x,y (so 2 features now: x and y) and z axis (shows end result, the transformation), we have 3D plot.
Just take 1 point (which consists of 2 features: x1 (x coordinate) and x2 (y coordinate)) and feed it to the network. Then take x1 (x axis), x2 (y axis) and the resulting output value of the neural network, which is a_out (z axis) and plot it. Do this for many many points. Actually create a grid of points (read pairs of x1 and x2 features). What you will see at the end is exactly what you see above - the 3D plot we have.
Here is btw how output of the resulting sigmoid (of the output layer's neuron) looks comparing to x1 and x2 sum:
def draw_slices(x_step = 0.5, y_step = 0.5):
b1,b2,b3,b_out,w1,w2,w3,w_out = parameters
x = np.arange(-80,80,x_step)
label = []
data = []
yy = np.arange(-10,10,y_step)
for j in yy:
y = np.zeros(len(x))
for i in range(0, len(x)):
y[i] = j
label.append(getCircleLabel(Point(x=x[i], y=y[i]), Point(x=0, y=0), radius=5))
x1x2_sum = x + y
# neuron 1 of layer 1
z1 = w1[0] * x + w1[1] * y + b1
a1 = expit(z1)
# neuron 2 of layer 1
z2 = w2[0] * x + w2[1] * y + b2
a2 = expit(z2)
# neuron 3 of layer 1
z3 = w3[0] * x + w3[1] * y + b3
a3 = expit(z3)
# neuron of the output layer for 3 sigmoids at hidden layer 1
z_out = w_out[0] * a1 + w_out[1] * a2 + w_out[2] * a3 + b_out
a_out = expit(z_out)
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
data.append(dict(
line=dict(width=2),
name = 'sigmoid (x2={})'.format(j),
x = x1x2_sum,
y = a_out))
layout = go.Layout(
title='x1 and x2 sum vs. output of the network (a_out)',
xaxis=dict(
title='x1x2_sum',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
),
yaxis=dict(
title='sigm_out (a_out)',
titlefont=dict(
family='Courier New, monospace',
size=18,
color='#7f7f7f'
)
)
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
draw_slices(x_step=0.1, y_step=10_000)
draw_slices(x_step=0.1, y_step=5)
draw_slices(x_step=0.1, y_step=2)
draw_slices(x_step=0.05, y_step=0.1)
Again, here we combine x1 and x2 by summing them (just for visual representation purposes), but in case of 3D, since we know individual components of the x1x2 sum, we can just take x1, x2 and a_out values and plot them to end up with 3D surface.
Btw our neural network has no idea it is doing it (creates 3D surface). It just reduces error (which also means improves the accuracy of prediction) for each data point during training and as a side effect 3D surface is created, which we can use then to also classify yet unseen data points.
To make this concept crystal clear, here is how we visualize 3D surface. By plotting many points, one by one. See plots below:
def build_3d_surface(y_list):
x1 = []
x2 = []
label = []
# initial version
for j in np.arange(-50,50,0.13):
for y_fixed in y_list: # locking y-coordinate to specifica value for current range of x values
x1.append(j)
x2.append(y_fixed)
label.append(getCircleLabel(Point(x=j, y=y_fixed), Point(x=0, y=0), radius=5))
df = pd.DataFrame({'x1': x1, 'x2': x2, 'label': label})
b1,b2,b3,b_out,w1,w2,w3,w_out = parameters
x1x2_sum = np.array(x1) + np.array(x2)
# neuron 1 of layer 1
z1 = np.dot(w1, [x1, x2]) + b1
a1 = expit(z1)
# neuron 2 of layer 1
z2 = np.dot(w2, [x1, x2]) + b2
a2 = expit(z2)
# neuron 3 of layer 1
z3 = np.dot(w3, [x1, x2]) + b3
a3 = expit(z3)
# neuron of the output layer
z_out = np.dot(w_out, [a1, a2, a3]) + b_out
a_out = expit(z_out)
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
df['x1x2_sum'] = x1x2_sum
df['tmp'] = 1
df['z1'] = z1
df['a1'] = a1
df['z2'] = z2
df['a2'] = a2
df['z3'] = z3
df['a3'] = a3
df['z_out'] = z_out
df['a_out'] = expit(z_out)
df['prediction'] = prediction
df['label'] = label
trace0 = go.Scatter3d(
x=x1,
y=x2,
z=a_out,
mode='markers',
marker=dict(
color=prediction,
size=6,
line=dict(
width=0.1
),
opacity=1.0
)
)
layout = go.Layout(
width=1920,
height=1080,
autosize=False,
title='plot'
)
data = [trace0]
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
How beautiful is that?
Let's add some more slices:
build_3d_surface([0, -1, -10])
build_3d_surface([10 , 5, 0, -1, -5, -10])
Another way to think about 3D surface is that it is a set of "slices" (look at plot above - it consists of 6 "slices"). What is slice? Each slice is a set of points where each point's y-coordinate is a fixed/locked/frozen value and x-coordinate is different for each point in slice. Also note that each slice is a combination of sigmoids with specific shape (which is partially defined by y-coordinate - this is why each slice has different shape, even if x coordinates are the same for each slice). Another interesting observation is that random generator "tunes" slices in such a way that orange points end up above blue ones.
Let's say we freeze y-coordinate at -1 and take range of x values from -10 to 10 with step = 0.1. This will be 1 slice of points. Next, we can change y to -2 and repeat the process. This will give us another slice. So the closer the y values are to each other the more well defined (dense) our 3D surface becomes.
Let's add more slices!
build_3d_surface(np.arange(-50,50,5))
Looks like an art to me!
Finally, let's even more points (or set of slices):
build_3d_surface(np.arange(-50,50,0.16))
So again, another way of thinking (as opposed to just plotting each point) of how 3D surface is being built is this: we create a (let's call it) "slice" of points where each point has same y-coordinate (in other words we freeze y-cordinate's value) but different x-coordinate. So we take some range of x-coordinates with fixed y-coodinate and plot that slice. Then we take another y value with range of x values and add another slice, until 3D surface looks well defined.
* btw we don't try to find actual equation which our neural network generates, so we evaluate a_out for each point and plot many points to get nice 3D shape. But it's fine, because in most cases we don't need to know what is the actual formula of function approximations created by neural networks.
While we discuss 3D surfaces, let's also explain intuition behind the concept of overfitting and underfitting you will encounter often while studying deep learning. It is actually pretty easy to explain based on our 3D plots above.
Idea is simple. If our surface is very smooth, kinda weakly defined, we have a underfitting. Just imagine if we had that crown (that peak in the middle of our 3D plot) being almost flat. Then it would be very hard to distinguish between orange points and blue points. We can think of it as undertraining, where peak hasn't "grown" yet.
Imagine another situation when each orange point creates a separate orange "island" instead of having some reasonably sized island where many points can "rest". In this case there would be no place for the new/unseen points to land at and they would fall into the "ocean" of blue points. This is called overfitting, where function created by a neural network is so precise and concrete that there is no "space" for generalization.
Hopefully you got the idea.
Now let's go to the next segment of our investigation and see how we can create a decision boundary which separates our points, blue ones from the orange ones.
# Visualizing decision boundaries
x1 = []
x2 = []
label = []
# sparse plotting
# for j in np.arange(-50,50,0.1):
# x1.append(j)
# x2.append(0)
# label.append(getCircleLabel(Point(x=j, y=0), Point(x=0, y=0), radius=5))
# # x1.append(j)
# # x2.append(1)
# # label.append(getCircleLabel(Point(x=j, y=1), Point(x=0, y=0), radius=5))
# x1.append(j)
# x2.append(-10)
# label.append(getCircleLabel(Point(x=j, y=-10), Point(x=0, y=0), radius=5))
# gridlike plotting
for j in np.arange(-10,10,0.1):
for k in np.arange(-10,10,0.1):
x1.append(j)
x2.append(k)
label.append(getCircleLabel(Point(x=j, y=k), Point(x=0, y=0), radius=5))
df = pd.DataFrame({'x1': x1, 'x2': x2, 'label': label})
b1,b2,b3,b_out,w1,w2,w3,w_out = parameters
x1x2_sum = np.array(x1) + np.array(x2)
# neuron 1 of layer 1
z1 = np.dot(w1, [x1, x2]) + b1
a1 = expit(z1)
# neuron 2 of layer 1
z2 = np.dot(w2, [x1, x2]) + b2
a2 = expit(z2)
# neuron 3 of layer 1
z3 = np.dot(w3, [x1, x2]) + b3
a3 = expit(z3)
# neuron of the output layer
z_out = np.dot(w_out, [a1, a2, a3]) + b_out
a_out = expit(z_out)
# predictions of our network for each data point
prediction = np.where(a_out >= 0.5, 1, 0)
df['x1x2_sum'] = x1x2_sum
df['tmp'] = 1
df['z1'] = z1
df['a1'] = a1
df['z2'] = z2
df['a2'] = a2
df['z3'] = z3
df['a3'] = a3
df['z_out'] = z_out
df['a_out'] = expit(z_out)
df['prediction'] = prediction
df['label'] = label
df = df.sort_values(by=['z1'])
sns.lmplot(x="x1", y="x2", data=df, hue="label", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1", y="x2", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="z1", y="a1", data=df, hue="label", fit_reg=False, size=4, aspect=2)
fig, axes = plt.subplots()
axes.plot(df["z1"], df["a1"])
axes.set_xlabel('z1')
axes.set_ylabel('a1')
df = df.sort_values(by=['z2'])
sns.lmplot(x="z2", y="a2", data=df, hue="label", fit_reg=False, size=4, aspect=2)
fig, axes = plt.subplots()
axes.plot(df["z2"], df["a2"])
axes.set_xlabel('z2')
axes.set_ylabel('a2')
df = df.sort_values(by=['z3'])
sns.lmplot(x="z3", y="a3", data=df, hue="label", fit_reg=False, size=4, aspect=2)
fig, axes = plt.subplots()
axes.plot(df["z3"], df["a3"])
axes.set_xlabel('z3')
axes.set_ylabel('a3')
df = df.sort_values(by=['z_out'])
sns.lmplot(x="z_out", y="tmp", data=df, hue="label", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="z_out", y="a_out", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
df = df.sort_values(by=['x1x2_sum'])
sns.lmplot(x="x1x2_sum", y="z1", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1x2_sum", y="a1", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1x2_sum", y="z2", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1x2_sum", y="a2", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1x2_sum", y="z3", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1x2_sum", y="a3", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1x2_sum", y="z_out", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
sns.lmplot(x="x1x2_sum", y="a_out", data=df, hue="prediction", fit_reg=False, size=4, aspect=2)
And your homework is this - try to draw a line of data points where some points in the middle are orange ones and points on the left and on the right are blue ones. And then try to make all the steps needed (which are fully explained above) to "train" neural network using random search to separate these points. Or do it manually (it shouldn't be that hard to come up with appropriate parameters). Or maybe if you know how backpropagation works, try to implement whole process with backpropagation and see visually how everything works.
After that try other function - maybe relu, elu, selu or any other function you know which works with backpropagation. Also try linear function instead of non-linear (basically just remove activation calculation phase from each neuron except the one in the output layer) and see what happens.
Wow, if you made it this far - congratulations! This concludes our journey.
I hope it was useful and interesting for you and that it made your understanding of neural networks (much) better!
Cheers & good luck with AI! :-)